DISCO+QR: rooting species trees in the presence of GDL and ILS

Abstract Motivation Genes evolve under processes such as gene duplication and loss (GDL), so that gene family trees are multi-copy, as well as incomplete lineage sorting (ILS); both processes produce gene trees that differ from the species tree. The estimation of species trees from sets of gene family trees is challenging, and the estimation of rooted species trees presents additional analytical challenges. Two of the methods developed for this problem are STRIDE, which roots species trees by considering GDL events, and Quintet Rooting (QR), which roots species trees by considering ILS. Results We present DISCO+QR, a new approach to rooting species trees that first uses DISCO to address GDL and then uses QR to perform rooting in the presence of ILS. DISCO+QR operates by taking the input gene family trees and decomposing them into single-copy trees using DISCO and then roots the given species tree using the information in the single-copy gene trees using QR. We show that the relative accuracy of STRIDE and DISCO+QR depend on the properties of the dataset (number of species, genes, rate of gene duplication, degree of ILS and gene tree estimation error), and that each provides advantages over the other under some conditions. Availability and implementation DISCO and QR are available in github. Supplementary information Supplementary data are available at Bioinformatics Advances online.


S1 Proof of Runtime Complexity of DISCO+QR
Given an unrooted tree T , we denote a rooted version of T by R e where R e is T rooted on edge e. If the rooted edge is not of significance, we omit e and simply write R.
Proposition S1.1. DISCO decomposes the input gene family trees into single copy trees in O(N n) time, where N is the total number of leaves in the gene family trees and n is the number of taxa (species labels) appearing in the gene trees.
Proof. The DISCO algorithm can be separated into three steps for each input gene-family tree G i . The first step, rooting (finding the most parsimonious root), can be done in O(n|L i |) time, where |L i | is the number of leaves in G i . This is briefly mentioned in the original algorithm description in [7] ("we used memoization to reduce the time to quadratic"). The second step, tagging, can be again done in O(n|L i |) time, which is clear from its algorithm in ASTRAL-Pro (Algorithm 1, [7]). The third step, decomposition, can be done in O(|L i |) time, which is self-evident from its pseudocode in the DISCO paper (Algorithm 1, [6]). The total running time of the DISCO decomposition step for all input gene family trees is The next two propositions give the runtime of the two steps of QR: preprocessing and scoring.
First, we bring a detailed description of the QR algorithm: Details of QR Algorithm. QR has a cost function Cost(R e | q , ⃗ u q ) that given ⃗ u q (i.e., ⃗ u q a 15dimensional vector), the empirical distribution of gene tree unrooted quintet topologies induced on q and a rooted quintet topology R e | q (denoting R rooted on e induced on q), calculates the "cost" of observing the distribution ⃗ u q assuming R e | q is the model rooted quintet topology. Given the set of sampled quintets Q, Score(e) is defined as q∈Q Cost(R e | q , ⃗ u q ). Our goal is to find the edge e that minimizes Score(e). Now we give the scoring algorithm.
1. Preprocess T . Build an empty list I(v) ⊆ Q at each internal node v of T . For each q ∈ Q, consider T | q . We extract the internal nodes x, y, z of T | q , where x, y, z are also internal nodes of T . Add q to I(x), I(y), and I(z). In effect, I(v) for any internal node v of T is defined by the quintets inside Q such that the homeomorphic subtrees of T induced on these quintets have v as an internal node.
2. Fix an arbitrary edge e. Calculate Score(e) by iterating over the quintets in Q.

3.
Consider an edge f adjacent to e connected by node v. We conceptually split Q into two sets. The first set Q δ is defined by those q ∈ Q s.t. R e | q ̸ = R f | q in topology, i.e., those quintets that have rooted topologies that differ when rerooting T from e to f . The second set Q ′ is Q − Q δ (quintets that don't differ in rooted topology when rerooting from e to f ). Observe that for any Further observe that Q δ = I(v), since by construction, for any q ∈ I(v), e and f are separated by an internal node of T | q that is v, hence R e | q and R f | q are different in rooted topology. Also, any q ̸ ∈ I(v) will be in Q ′ because e and f are located on the same edge of T | q (where an edge in T | q is a path in T ).
4. Finally, recall that we already calculated Score(e) for some arbitrary e. We proceed by traversing all edges of T starting from e, and when visiting an edge f , we derive Score(f ) using the above-mentioned update strategy. The above strategy will be called at most twice per internal node v because each v has bounded degree three.
We now bring the runtime complexity analysis of preprocessing and scoring steps.
Proposition S1.2. The preprocessing step of QR that computes the cost of each possible rooting of an unrooted quintet can be done in O(k(|Q| + n)), where k is the number of input single-copy gene trees and Q is the set of sampled quintets in QR.
Proof. Recall that the cost of a rooted quintet tree on taxon-set q is defined with respect to an empirical distribution of the unrooted quintet topologies of q observed in the gene trees denoted by ⃗ u q . Therefore we first obtain this ⃗ u q , after which the cost for a given rooted quintet tree can be computed in O(1) time.
For each single-copy gene tree G, we first equip G with an LCA data structure (see Lemma S1.4) in O(n) time, after which the unrooted quintet topology of any q can be queried on G in O(1) time (by Lemma S1.5). As such, each ⃗ u q can be computed in O(k) time, and all ⃗ u q s where q ∈ Q can be computed in O(|Q|k) time. Adding in the time to initialize the LCA data structures results in a total running time of O((|Q| + n)k).
Once the ⃗ u q s are computed, one can iterate over all seven possible rooted topologies of T | q , where T | q is the unrooted species tree induced on q, to compute the costs against ⃗ u q . Since there are a constant number of topologies per q, our total running time is unaffected and remains O((|Q| + n)k).
Proposition S1.3. The scoring step of QR can be done in O(n + |Q|) time for an unrooted species tree T with n taxa, where Q is the set of quintets sampled.
Proof. Given that R e | q can be queried in O(1) time according to Lemma S1.5, each ⃗ u q already calculated in the QR preprocessing step, and that Cost is a constant time function, Score(e) can be derived in O(|Q|) time for arbitrary e by simply iterating over all quintets in set Q. Now we compute the runtime for each step of the QR scoring algorithm: 1. Preprocessing T takes O(n) time according to Lemma S1.4, and computing I(v) for all internal nodes of T takes O(|Q|) time, as each q ∈ Q is analyzed in constant time. Therefore, this step takes O(n + |Q|) time.
2. Calculating Score(e) for one fixed edge e can be simply done by iterating over all the quintets in set Q. This takes O(|Q|) time.

3.
For an edge f connected to e by node v, updating and Q δ contains quintets q such that R e | q and R f | q are different in rooted topology and only the cost of these quintets should be updated in Score(f ), which can be done in constant time per quintet.
4. Each internal node of T will be traversed at most twice in the final step, and hence the scores of all edges can be computed in at most v 2 · (1 + O(|I(v)|)) time.
In this way, we obtained scores for all edges of T in at most v where v ranges over all internal nodes of T . Therefore, the total runtime of the scoring step of QR is O(n + |Q|).
Lemma S1.4. Given rooted tree R with l leaves, after O(l) preprocessing time, the following queries can be completed in O(1) time for arbitrary nodes u, v of R.
1. lca(u, v), the least common ancestor of u and v 2. depth(u), the distance (number of edges) between u and the root. If depth(u) > depth(v), u is said to be deeper (or lower) than v.
3. d(u, v), the number of edges between u and v 4. ancestor(u, v), which decides if v is an ancestor of u Proof. These are well known. (1) is due to [2]. (2) is trivial (preprocess by DFS from the root) and is frequently a byproduct of the preprocessing used for (1).
) and hence can be queried in constant time. For (4), u is an ancestor of v if lca(u, v) = u and hence can be queried in constant time.
We simply refer to this preprocessing of Lemma S1.4 as equipping R with the LCA data structure.
Lemma S1.5. Given rooted tree R (with T as its unrooted topology) equipped with the LCA data structure, the following queries can be completed in O(1) time given five taxa q = (q 1 , q 2 , q 3 , q 4 , q 5 ) and e an edge in R: 1. R| q , the homeomorphic (rooted) subtree of R induced on q 2. T | q , the homeomorphic subtree of T induced on q 3. R e | q , first reroot R on e, then consider the rooted quintet topology of R e induced on q Proof. We start with (1). Consider the set S of nodes in R inductively defined as follows. First, q i ∈ S where 1 ≤ i ≤ 5. Second, if u, v ∈ S, then lca(u, v) ∈ S. Observe that S is constant-size by construction and contains exactly all the nodes of R| q . Construct S in constant time, and for any u, v ∈ S, if u is an ancestor of v, put a directed edge from u to v, forming a graph H on S.
Compute the transitive reduction on H, which results in R| q . Notice that R| q is of constant size.
For (2), since R| q is of constant size, we can use brute force to extract the two bipartitions of the unrooted topology of R| q in constant time, which defines T | q .
For (3), first observe that given two nodes x, y in R and an edge e = (u, v) in T , it can now be decided in constant time if e lies on the path between x and y. A node z is between x, y iff d(x, z) + d(z, y) = d(x, y). Therefore, if u and v both lie between x, y, then e lies between x, y. First compute R| q . Then, among all edges of R| q , we need to find the edge x, y s.t. e is located on the path between x and y on T , which as we just established can be done in constant time per edge, and we have a constant number of edges. Then reroot R| q on this edge x, y on which e lies, obtaining our defined R e | q .

S2.5 Rooting (Clade Distance) Error, MGTE, AD
Mean gene tree estimation error (MGTE) and average distance (AD) between the model species tree and true gene trees were calculated using a script written by Erin. K. Molloy for computing normalized Robinson-Foulds (RF) distance, available at https://github.com/ekmolloy/fastmulrfs/blob/master/python-tools/compare two trees.py.
We measured the rooting error using average normalized clade distance (nCD) which is an extension of RF distance for rooted trees. We used the script for computing average normalized clade distance available at https://github.com/ytabatabaee/Quintet-Rooting/blob/main/scripts/clade distance.py which is extended and modified from the script written by Erin. K. Molloy for computing RF distance for unrooted trees.

S3 Additional Information about Datasets
To create the new simulated data for this study, we used SimPhy [3] with the following command: simphy To simulate gene tree estimation error, we needed to generate sequence data. We used INDELible [1] with GTR parameters taken from the FastMulRFS study [4]. We also used scripts from that study (

S4 Additional Results
This section has tables with means and medians for the results (given in figures) from the main paper.        Table S7: Means and medians for all model conditions from Figure 6. Error (nCD) of STRIDE and DISCO+QR on a simulated dataset with 1000 taxa (and one outgroup taxon) and varying numbers of genes. (a): 1000 genes, (b): 500 genes, (c): 100 genes. The species tree was estimated with ASTRID-DISCO. The simulated model conditions contain 10 replicates, a duplication rate of 5 × 10 −10 and an equal loss rate, AD ≈ 20%, and ≈ 40% MGTE.